Journal of Medical Imaging
● SPIE-Intl Soc Optical Eng
Preprints posted in the last 90 days, ranked by how well they match Journal of Medical Imaging's content profile, based on 11 papers previously published here. The average preprint has a 0.01% match score for this journal, so anything above that is already an above-average fit.
Killekar, A.; Shanbhag, A.; Miller, R. J.; Dey, D.; Bourque, J.; Phillips, L.; Chareonthaitawee, P.; Slomka, P.
Show abstract
BackgroundPrevious studies evaluated large language model (LLM) performance on the American Society of Nuclear Cardiology (ASNC) Board Preparation Exam. Without domain-specific context, the best model (GPT-4o) achieved 63.1%, below the estimated 65% passing threshold and the 78% mean score of human fellows-in-training (FITs). Providing textbook context improved GPT-4o to 73.8% on text-only questions, but still fell short of human trainees. Whether next-generation LLMs with retrieval-augmented generation (RAG) can exceed this gap is unknown. MethodsClaude Opus 4.7 and GPT-5.5 were administered all 168 questions (141 text-only, 27 image-based) from the 2023 ASNC Board Preparation Exam across 5 iterations each, using RAG with a nuclear cardiology textbook, companion atlas, and ASNC clinical guidelines. Claude used local FAISS-based semantic retrieval; GPT-5.5 used Azures cloud-hosted vector store. Performance was compared to prior LLM results and 13 human FITs. ResultsAcross 5 iterations, Claude Opus 4.7 achieved a mean accuracy of 86.3% {+/-} 1.4% (text 88.8%, image 73.3%). GPT-5.5 achieved 86.7% {+/-} 2.2% (text 88.5%, image 77.0%) but refused a mean of 12.2 questions (7.3%) per iteration due to safety filters. Both models surpassed the human FIT mean (78.0%) and the estimated passing threshold. Compared to GPT-4o without context (63.1%), this represents a 23-percentage-point improvement in 18 months. ConclusionNext-generation LLMs with RAG now surpass average human trainee performance on nuclear cardiology board preparation questions, suggesting significant potential as educational tools and knowledge-reference aids in cardiovascular imaging. Condensed AbstractAcross 5 iterations each, Claude Opus 4.7 and GPT-5.5 with retrieval-augmented generation achieved mean accuracies of 86.3% and 86.7% on the 2023 ASNC Board Preparation Exam (168 questions), both surpassing the mean human fellow-in-training score of 78%. GPT-5.5 refused a mean of 12.2 questions (7.3%) per iteration due to safety filters. These results represent a 23-percentage-point improvement over the best prior LLM without context (63.1%), demonstrating that RAG-enhanced LLMs have reached human-level proficiency in nuclear cardiology knowledge. Graphical Abstract O_FIG O_LINKSMALLFIG WIDTH=200 HEIGHT=111 SRC="FIGDIR/small/26352768v2_ufig1.gif" ALT="Figure 1"> View larger version (49K): org.highwire.dtl.DTLVardef@5f2465org.highwire.dtl.DTLVardef@4e80d3org.highwire.dtl.DTLVardef@1ebbb93org.highwire.dtl.DTLVardef@167d3c1_HPS_FORMAT_FIGEXP M_FIG C_FIG Overview of the three-study research arc evaluating LLM performance on the 2023 ASNC Board Preparation Exam. Study 1 (2024) tested four LLMs without context (best: GPT-4o, 63.1%). Study 2 (2025) added textbook context to GPT-4o (73.8%). Study 3 (2026, current) evaluated Claude Opus 4.7 and GPT-5.5 with retrieval-augmented generation across 5 iterations each (mean 86.3% and 86.7%, respectively), both surpassing the human fellow-in-training mean of 78%. Right panel shows the performance scale with key thresholds.
Sivakumar, E.; Anand, A.
Show abstract
Computer vision and deep learning techniques, including convolutional neural networks (CNNs) and transformers, have increased the performance of medical image classification systems. However, training deep learning models using medical images is a challenging task that necessitates a substantial amount of annotated data. In this paper, we implement data augmentation strategies to tackle dataset imbalance in the VinDr-SpineXR dataset, which has a lower number of spine abnormality X-ray images compared to normal spine X-ray images. Geometric transformations and synthetic image generation using Generative Adversarial Networks are explored and applied to the abnormal classes of the dataset, and classifier performance is validated using VGG-16 and InceptionNet to identify the most effective augmentation technique. Additionally, we introduce a hybrid augmentation technique that addresses class imbalance, reduces computational overhead relative to a GAN-only approach, and achieves [~]99% validation accuracy with both classifiers across all three case studies.
Wang, S.; Ayubcha, C.; Hua, Y.; Beam, A.
Show abstract
BackgroundDeveloping generalizable neuroimaging models is often hindered by limited labeled data which has led to an increased interest in unsupervised inverse learning. Existing approaches often neglect geometric principles and struggle with diverse pathologies. We propose a symmetry-informed inverse learning foundation model to address these shortcomings for robust and efficient anomaly detection in brain MRI. MethodsOur framework employs a reconstruction-to-embedding pipeline, trained exclusively on healthy brain MRI slices. A 2D U-Net uses a novel, symmetry-aware masking strategy to reconstruct a disorder-free slice. Difference maps are embedded into a 1024-dimensional latent space via a Beta-VAE. Anomaly scoring is performed using Mahalanobis distance. We evaluated generalization by fine-tuning on external lesion datasets, BraTS Africa (SSA), and the ADNI-derived Alzheimers disease cohort (Alz). ResultsOn the source metastasis (Mets) dataset, the framework achieved high performance (AB1+MSE: 99.28% accuracy, 99.79% sensitivity). Generalization to the external lesion dataset (SSA) was robust, with the Symmetry ROC configuration achieving 91.93% accuracy. Transfer to the Alzheimers dataset (Alz) was more challenging, achieving a peak accuracy of 70.54% with a high false-positive rate, suggesting difficulty in separating subtle, diffuse changes. ConclusionThe symmetry-informed inverse learning framework establishes a robust foundation model for neuroimaging, showing strong performance for focal lesions and successful generalization under domain shift. Limitations in diffuse neurodegeneration underscore the necessity for richer representations and multimodal integration to improve future foundation models. Summary StatementA symmetry-informed inverse learning framework trained on normal brain MRI achieved high accuracy for detecting focal lesions and demonstrated strong generalization across external datasets under domain shift. Key Points[bullet] A symmetry-informed disorder-free reconstruction framework trained only on normal brain MRI achieved 99.28% accuracy and 99.79% sensitivity for metastasis detection on the BrainMetShare dataset, demonstrating non-inferior performance compared with all but one strategy while offering improved computational efficiency. [bullet]The model generalized effectively to an external tumor dataset (BraTS SSA), achieving up to 91.93% accuracy using receiver operating characteristic-optimized thresholding with minimal fine-tuning. [bullet]Embedding-based anomaly detection using Mahalanobis distance enabled consistent separation between normal and abnormal slices, supporting robust and interpretable anomaly detection across datasets.
Namvar, A.; Shan, B.; Hoff, B.; Labaki, W. W.; Murray, S.; Bell, A. J.; Galban, S.; Kazerooni, E. A.; Martinez, F. J.; Hatt, C. R.; Han, M. K.; Galban, C. J.; Ram, S.
Show abstract
PurposeTo develop an interpretable feature-based Deep Parametric Response Mapping (PRMD) method that combines wavelet scattering convolution networks and machine learning to spatially detect and quantify functional small airways disease (fSAD) and emphysema on paired inspiratory-expiratory CT scans, with enhanced noise robustness. Materials and MethodsIn this retrospective analysis of prospectively acquired data (2007-2017), we developed and validated a deep learning-based PRM approach using paired CT scans from 8,972 tobacco-exposed COPDGene participants ([≥]10 pack-years; mean age 60.1 {+/-} 8.8 years; 46.5% women), including controls with normal spirometry (n = 3,872; controls), PRISm (n = 1,089), GOLD 1-4 COPD (n = 4,011). Data were stratified into training, validation, and testing sets (24:6:70). PRMD extracts translation-invariant image features using a wavelet scattering network and applies a subspace learning classifier to classify voxels as emphysema or non-emphysematous air trapping (fSAD). PRMD was compared with conventional density-based PRM for voxel-wise agreement, correlation with pulmonary function, robustness to noise, and sensitivity to misregistration using Pearson correlation, Bland-Altman analysis, and paired t tests. ResultsPRMD achieved 95% voxel-wise agreement with standard PRM (r = 0.98) while demonstrating significantly greater robustness under noise. PRMD showed stronger correlations with FEV (emphysema: r = -0.54; fSAD: r = -0.51; P < 0.0001) than standard PRM (r = -0.42 for both; P < 0.0001). Under simulated high-noise conditions, standard PRM overestimated disease by [~]15%, whereas PRMD limited error to < 5% (P < 0.001). ConclusionPRMD provides an interpretable, feature-driven and noise-resilient alternative to traditional PRM for emphysema and fSAD classification, enhancing the reliability of CT-based COPD phenotyping for multi-center studies and low-dose imaging applications. Key PointsO_LIThis study introduces combined wavelet scattering and subspace learning for medical image segmentation, enabling accurate, interpretable voxel-level classification of emphysema and functional small airways disease on paired CT scans. C_LIO_LIThe proposed Deep Parametric Response Mapping method demonstrated 95% voxel-wise agreement with standard Parametric Response Mapping and stronger correlations with spirometric measures, enhancing the clinical relevance of CT-based phenotyping for Chronic Obstructive Pulmonary Disease. C_LIO_LIDeep Parametric Response Mapping significantly improved robustness to image noise--reducing overestimation of emphysema and functional small airways disease from [~]15% to <5% (P < 0.001)--and benefits from reduced data requirements due to the fixed, mathematically defined filters used in wavelet scattering. C_LI Summary StatementDeep Parametric Response Mapping improves the accuracy and noise robustness of CT-based classification of emphysema and functional small airways disease using feature-based representations, enhancing the reliability of COPD phenotyping.
Navarro-Gonzalez, R.; Aja-Fernandez, S.; Planchuelo-Gomez, A.; de Luis-Garcia, R.
Show abstract
Foundation models (FMs) for brain magnetic resonance imaging (MRI) are increasingly adopted as pretrained backbones for clinical tasks such as brain age prediction, disease classification, and anomaly detection. However, if FM embeddings (internal representations) shift systematically across MRI scanners, downstream analyses built on them may reflect acquisition hardware rather than biology. No study has yet quantified this cross-scanner reproducibility. Here, we assess the cross-scanner reliability of brain MRI FM embeddings and investigate which design factors (pretraining strategy, network architecture, embedding dimensionality, and pretraining dataset scale) best explain the observed differences. Using the ON-Harmony travelling-heads dataset (20 participants, eight scanners, three vendors), we evaluate the embeddings of five architecturally diverse FMs and a FreeSurfer morphometric baseline via within- and between-scanner intraclass correlation coefficient (ICC), variance decomposition, and scanner fingerprinting. Reliability spanned the full spectrum: biology-guided models achieved good-to-excellent cross-scanner ICC (AnatCL: 0.970 [95\% confidence interval (CI): 0.94, 0.98]; y-Aware: 0.809 [0.63, 0.88]), matching or surpassing FreeSurfer (0.926 [0.83, 0.96]), whereas purely self-supervised models fell below the poor threshold (BrainIAC: 0.453, BrainSegFounder: 0.307, 3D-Neuro-SimCLR: 0.247), with 23--58\% of embedding variance attributable to scanner identity. The strongest correlate of cross-scanner reliability among the models evaluated was pretraining strategy: incorporating biological metadata (cortical morphometrics, age) into the contrastive objective produced scanner-robust embeddings, whereas architecture, dimensionality, and dataset scale did not predict reliability.
Alamoudi, N.; Valdes Hernandez, M. d. C.; Seth, S.; Jin, B.; Sakka, E.; Arteaga-Reyes, C.; Mair, G.; Jaime-Garcia, D.; Cheng, Y.; Jochems, A. C. C.; Wardlaw, J. M.; Bernabeu Llinares, M. O.
Show abstract
PurposeWhite matter hyperintensities are a key imaging marker of vascular pathology, defined on brain magnetic resonance imaging (MRI) and typically manifesting on non-contrast computed tomography (CT) as subtle white matter hypoattenuation (WMH). Accurately segmenting WMH in CT scans remains challenging due to their low contrast with the surrounding tissue. This work presents an end-to-end framework for WMH segmentation in CT scans and validates the design choices in each step of the processing pipeline. We leverage a state-of-the-art deep-learning method combined with manually annotated and pseudo-labelled datasets from paired CT-MRI scans from different clinical scanners to deliver reliable outcomes. ApproachOur framework includes DICOM data curation, sequence selection, and automatic label generation as preparation steps. Preprocessing includes z-score intensity normalisation, skull stripping, CT windowing and two-step CT-MRI registration to accurately transfer MRI-derived labels into the CT space. Further processing involves the use of a 3D nnU-Net initially trained on CT images with aligned MRI-based WMH manually derived (n=91) and fine-tuned with two additional pseudolabelled datasets (n=191). FindingsCT-based WMH volumes showed a near-perfect correlation with ground-truth MRI WMH volumes (r = 0.98), with a systematic overestimation (mean difference = 2.40 mL; 95% limits of agreement: -8.31 to 13.11 mL) that may be adjustable in downstream tasks. This overestimation reflected challenges in the precise delineation of small WMH lesions and confounding from other imaging markers of brain disease. Across the evaluated cohort, ground-truth WMH volumes ranged from 1.02 to 149.34 mL. The best-performing configuration achieved a mean absolute error below 3 mL, corresponding to approximately 17% of the mean WMH volume, and a mean Dice similarity coefficient of 0.57. Segmentation accuracy decreased in the presence of stroke lesions. Models trained on single-pathology datasets, as well as approaches relying on template-based spatial normalisation, did not achieve satisfactory performance despite using the same backbone network configuration. ConclusionUsing a multi-centre dataset and a multi-modal approach with expert-annotated data combined with pseudo-labelled data for training can substantially narrow the performance gap between CT- and MRI-based WMH segmentation. The framework proposed provides a generalisable solution that underscores the practical viability of CT for evaluating WMH burden in clinical and research scenarios--particularly where MRI is unavailable or contraindicated--thereby broadening access to small-vessel disease assessment.
Rudi, G.; Vula, F.; Bicaku, A.; Dedushi, K.; Ahmetgjekaj, I.
Show abstract
Computed tomography is the largest contributor to population radiation dose from medical imaging, yet no diagnostic reference levels (DRLs) have been published from Kosovo or the Western Balkans. This retrospective audit analyzed all CT examinations performed on a 128- slice scanner at the University Clinical Centre of Kosovo between January and March 2026. After exclusions, 1,535 acquisitions from 1,092 patients across nine examination categories were analyzed. Local DRLs were defined as the 75th percentile and compared against German (BfS 2022) and Turkish (Kahraman et al., 2024) reference values. Head CT (n = 590) demonstrated CTDIvol 4.7% below the BfS DRL yet scan length 98.5% above the orientation value (median 25.8 vs 13 cm). Abdomen-pelvis CTDIvol matched the BfS reference while scan length exceeded it by 28%. Coronary CTA showed CTDIvol +377%, consistent with retrospective ECG gating. Excess scan length, not CTDIvol, is the major driver of elevated dose at this institution. The identified excesses are correctable through technologist landmarking training, protocol review, and enabling iterative reconstruction.
Tejaswi, A.; Fyrdahl, A.; Sigfridsson, A.
Show abstract
Background: Cardiovascular magnetic resonance (CMR) quantification of the left ventricular (LV) volumes and ejection fraction (EF) typically involves manual segmentation of many short axis (SAx) and long axis (LAx) slices of the left ventricle. The scan time and the number of breath holds is proportional to the number of slices. We aimed to evaluate a geometric model of the left ventricle that could enable planimetry from a reduced number of slices. We sought to determine whether acceptable accuracy was retained for evaluating the End Diastolic Volume (EDV), End Systolic Volume (ESV), Stroke Volume (SV), and EF to provide a rapid and reliable clinical alternative. Methods: A cohort of 342 patients, median age: 54 (40 - 65) years, with full-stack CMR examinations was used. Nine geometrical combinations were evaluated: 3, 4 or 5 short axis slices and one of three LAx orientations (2-chamber, 3-chamber or 4-chamber) by retrospectively decimating the full-stack acquisition. LV volumes were calculated as a sum of trapezoidal approximations for apical and mid-cavity slices and a generalized prismoidal model at the base. The accuracy of the volume calculations was quantified against the full-stack reference for the EDV, ESV, SV, and EF using concordance correlation coefficient (CCC), two-way repeated measures ANOVA, pairwise tests, and Bayes factor log10(BF10) analysis. Results: The choice of the long axis (LAx) view was the most influential driver of accuracy (g2 = 0.104, for EDV), approximately 50 times more impactful than the number of SAx slices (g2 = 0.002, for EDV). Volumes calculated using the combination of 2-chamber LAx view and 5 SAx slices had the highest concordance with the full stack (CCC>0.90). While the estimated absolute volumes displayed a systematic negative bias, EF and SV remained highly robust due to bias cancellation. For a 2ch + 5 SAx protocol, EF bias was just 0.83% (LoA: -6.18 to 7.84%), with a minimum detectable change (MDC) of 7.01%, compared to 8.7% reported for expert human readers, suggesting strong concordance. Bayesian paired-samples t-tests yielded log10(BF10) = 6.42 in favor of 5 SAx over 3 SAx, constituting decisive evidence on the Jeffreys scale. The bias and limits of agreement (LoA) for stroke volume and ejection fraction were found to be lower than scan-rescan reproducibility in literature. Conclusion: This reduced-slice geometric model allows for reduced number of breath holds compared to a conventional full-stack CMR acquisition and provides an acceptable accuracy with bias less than scan-rescan variability.
Takita, H.; Mitsuyama, Y.; Walston, S. L.; Saito, K.; Sugibayashi, T.; Okamoto, M.; Suh, C. H.; Ueda, D.
Show abstract
PurposeMedical imaging typically generates 12- to 16-bit formats, yet conversion to 8-bit is often required. While deep learning has been widely explored in medical imaging, the influence of image bit depth on model performance is not fully understood. This study evaluates the impact of conversion from 16-bit to 8-bit for sex, age, and obesity classification using deep learning. Materials and methodsIn this retrospective, multi-institutional study, we analyzed 100,002 chest radiographs from 48,047 participants across three institutions. Three convolutional neural network architectures (ResNet52, EfficientNetB2, and ConvNeXtSmall) were trained on both 16-bit and 8-bit versions of the images. Model performance was evaluated using internal test datasets, randomly split multiple times, and an external test dataset. Statistical analysis included paired comparisons of area under the receiver operating characteristic curve (AUC-ROC) values, with Bonferroni correction for multiple comparisons. ResultsAcross all architectures and classification tasks, differences between 16-bit and 8-bit model performance were minimal (mean differences ranging from -0.218% to 0.184%). Statistical analyses revealed no significant differences in AUC-ROC values between bit depths for any model-task combination (all p-values > 0.05 after Bonferroni correction). Effect sizes were small to moderate (Cohens d ranging from -0.415 to 0.391). ConclusionReducing image bit depth from 16-bit to 8-bit does not significantly impact the performance of deep learning models in chest radiograph analysis. These findings suggest that 8-bit images can be used for deep learning applications in medical imaging without compromising model performance, potentially allowing for more efficient data storage and processing.
Barbero-Mota, M.; Annio, G.; Rucher, G.; Martorell, J.
Show abstract
Myocaridum biomechanics are a biomarker for multiple cardiac pathologies. However the rapid and complex heart motion hampers accurate measurements of the tissue stiffness. Current in vivo methods for the evaluation of myocardium mechanical health are either highly invasive or can only provide with a global surrogate of heart function as they suffer from poor spatiotemporal resolution. We propose a new in vivo technique, transient magnetic resonance elastography (tMRE), to assess the dynamic cardiac biomechanics. tMRE is able to quantify local shear wave speed as a proxy for myocardial stiffness at user-defined times within the cardiac cycle. We report proof-of-concept results where we probe the septum of 4 different healthy rat specimens at 3 physiologically distinct cardiac phases. We provide with apparent speed measurements for early systole, mid-late systole and early diastole that match the expected values from the cardiac cycle physiological mechanics. We correct for non-negligible geometrical biases using literature results and report true stiffness values where possible. Finally, we validate tMRE in phantom experiments.
de Boer, S.; Häntze, H.; Ziegelmayer, S.; van Ginneken, B.; Prokop, M.; Bressem, K. K.; Hering, A.
Show abstract
BackgroundMedical imaging, especially computed tomography and magnetic resonance imaging, is essential in clinical care of patients with renal cell carcinoma (RCC). Artificial intelligence (AI) research into computer-aided diagnosis, staging and treatment planning needs curated and annotated datasets. Across literature, The Cancer Genome Atlas (TCGA) datasets are widely used for model training and validation. However, re-annotation is often necessary due to limited access to public annotations, raising entry barriers and hindering comparison with prior work. MethodsWe screened 1915 CT scans from three TCGA-RCC databases and employed a segmentation model to annotate kidney lesion. After a meta-data-based exclusion step, we hosted a reader study with all papillary (n=56), chromophobe (n=27) and 200 randomly selected clear cell RCC cases. Two students quality checked and corrected the data as well as annotated tumors and cysts. Uncertain cases were checked by a board-certified radiologist. ResultsAfter data exclusion and quality control a total of 142 annotated CT scans from 101 patients (26 female, 75 male, mean age 56 years) remained. This includes 95 CTs with clear cell RCC, 29 with papillary RCC and 18 with chromophobe RCC. Images and voxel-level annotations of kidneys and lesions are open sourced at https://zenodo.org/records/19630298. ConclusionBy making the annotations open-source, we encourage accessible and reproducible AI research for renal cell carcinoma. We invite other researchers who have previously annotated any of these cohorts to share their annotations.
Fernandez Topham, J.; Guerrero Hurtado, M.; del Alamo, J. C.; Bermejo, J.; Martinez Legazpi, P.
Show abstract
BackgroundPressure-volume (PV) loop analysis remains the gold standard for assessing the intrinsic global diastolic properties of the left ventricle (LV). Traditional fitting techniques rely on local, phase-constrained fittings and are limited due to their sensitivity to noise, landmark selection, violation of assumptions, and non-convergence. ObjectiveTo develop and validate DIA-PINN, a physics-informed neural network (PINN) framework capable of calculating intrinsic diastolic properties of the LV from measured instantaneous PV data, combining mechanistic interpretability with machine learning flexibility. MethodsInstantaneous LV diastolic pressure was modeled as the sum of 1) time-dependent relaxation-related pressure and 2) volume-dependent recoil and stiffness-related pressures. DIA-PINN was trained using time, LV pressure and volume as inputs, enforcing data fidelity, model consistency, and physiological plausibility within the loss function. Performance was evaluated in 4,000 Monte Carlo simulations of LV PV-loops, and in clinical data from 59 patients who underwent catheterization (39 with heart failure and normal ejection fraction and 20 controls). DIA-PINN derived indices were compared to those obtained from a previously validated global optimization method (GOM). ResultsOn the simulation data, DIA-PINN accurately recovered all constitutive indices (intraclass correlation coefficients near unity) and improved GOM performance. On the clinical data, diastolic indices derived using DIA-PINN strongly correlated with GOM estimates (R>0.90, p<0.001) but were insensitive to initialization. DIA-PINN performed best under vena cava occlusion, as varying preload improved parameter identifiability. ConclusionsWhen applied to instantaneous pressure-volume data, a generalizable PINN framework, DIA-PINN, provides an improved method for assessing global intrinsic diastolic properties of cardiac chambers. New & NoteworthyOur work introduces DIA-PINN, a physics-informed neural network framework to process instantaneous ventricular pressure-volume data, solving a mechanistic model of diastole with machine learning techniques. Compared to current conventional or optimization-based approaches, the PINN provides the most reliable estimates of diastolic stiffness, relaxation, and elastic recoil, unsensitive to initialization. By embedding physiological constraints into network training, this approach achieves robust, interpretable, and clinically applicable quantification of gold-standard metrics of intrinsic global diastolic chamber properties.
Yang, K.; Shi, P.; Huang, H.; Musio, F.; Baazaoui, H.; Aydin, O. U.; Hilbert, A.; Hamadache, R. E.; Yalcin, C.; Zhang, M.; Falcetta, D.; de la Rosa, E.; Shit, S.; Prabhakar, C.; Wittmann, B.; Rokuss, M. R.; Kirchhoff, Y.; Al-Maskari, R.; Hoeher, L.; Juchler, N.; Casamitjana, A.; Cleary, J.; Schmick, A.; Baumgartner, P.; Deseoe, J.; Vandans, O.; Lee, D.; Oh, K.; LaBella, D.; Mazher, M.; Niederer, S. A.; Qayyum, A.; Liu, Y.; Chen, J.; Kim, W.; Asawalertsak, N.; Kim, M.; Shin, D.; Park, S.-H.; Kikuchi, S.; Zhang, Y.; Liu, J.; Cui, Y.; Qiu, Y.; Verschuur, A.; Zhang, J.; van der Schaaf, I.; Su, R.;
Show abstract
We present the TopBrain 2025 Challenge, the first benchmark for fine-grained multiclass segmentation of the whole brain vasculature in both computed tomography angiography (CTA) and magnetic resonance angiography (MRA). Building on the TopCoW challenge, TopBrain scales vessel annotation from the Circle of Willis to the entire brain, introducing a dataset of 90 annotated volumes across 48 landmark vessel classes spanning arterial and venous systems, of which 50 training volumes are publicly released. Vessel definitions were consolidated from established neuroanatomical references into a unified annotation scheme, and vessel caliber measurements along the centerline are reported for the first time across the whole brain vascular anatomy. To address the unique challenges of multiclass brain vessel segmentation, we propose an evaluation framework that accounts for detection in segmentation performance, assesses anatomical plausibility, and introduces novel contamination metrics that characterize inter-class prediction errors. Fifteen teams from over 220 registered participants submitted algorithms to the benchmark. The top-performing teams built on nnUNet with principled system design choices, achieving around 80% Dice scores, near-zero invalid neighbor counts, over 60% F1 scores for side-road vessels, and below 18% foreground contamination ratio. Larger vessels are easier to segment, while smaller and more complex vessels remain the true bottleneck. The annotated datasets and podium-finish algorithms are made publicly available on Zenodo.
Kim, T.; Baker, T.; Burris, N.; Figueroa, A.
Show abstract
Aortic stiffness is both heterogenous and anisotropic. Current non-invasive methods to estimate aortic stiffness are limited to characterizing the aortic tissue as isotropic due to the lack the techniques required to extract multi-axial strain from 3D dynamic images. Vascular deformation mapping (VDM) is a nonrigid image registration technique which has thus far been applied to map aortic growth using longitudinal imaging. In this study, we propose to use VDM to assess 3D aortic deformation by mapping diastolic and systolic images. During image registration process, penalty parameters are employed to fine-tune image alignment and penalize non-physiological deformations. These penalty parameters must be calibrated to ensure that VDM successfully reproduces multi-axial aortic motion patterns in health and disease. In this paper, we developed a calibration pipeline for these parameters using synthetic data. A rotation-free shell model was used to generate physics-based synthetic data on aortic motion incorporating patient-specific geometries, root motion, and blood pressure from a cohort of 14 subjects (healthy, Marfans syndrome and thoracic aortic aneurysm). An error metric was defined to quantify the quality of the VDM results. Furthermore, a k-means clustering technique was used to categorize the subjects into three clusters based on ascending aortic motion. Optimal penalty parameters were identified for each of the three clusters. The results indicated that patient clusters with smaller aortic root motion required larger rigidity penalty values. The calibrated parameters successively reduced errors in 3D displacement and multi-axial stretch compared to un-optimized VDM predictions, enhancing the accuracy of capturing aortic deformation from dynamic images. Among the different aortic regions, the ascending thoracic aorta exhibits the largest error reduction.
Ludwig, K. D.; Hatt, C. R.; Keith, L.; Matyga, A. W.; Te, H. S.; Landeras, L.; Chelala, L.; Patel, A. R.; Chung, J. H.
Show abstract
Objective: Coronary artery calcification (CAC) assessment for cardiovascular risk stratification is traditionally achieved using ECG-gated computed tomography (CT). Automated deep-learning (DL) algorithms may streamline opportunistic CAC detection and scoring, particularly on non-gated CT scans. This study evaluated the performance of a fully automated DL-based CAC scoring algorithm ("DL-CAC") against expert human scoring. Methods: The algorithm was trained on 1,260 chest CT scans from multiple databases to automatically identify coronary calcium, calculate Agatston scores, and assign a cardiovascular disease (CVD) risk classification. Performance was assessed on a holdout dataset (n=500) comprising ECG-gated calcium scoring CT scans and lung cancer screening non-gated chest CTs as well as in an external, independent CT dataset (n=129) from liver transplant candidates. Agreement with expert scoring was assessed using intraclass correlation coefficient (ICC) for Agatston scores and Cohen's {kappa} for CVD risk classification. Results: The algorithm demonstrated high agreement with expert scoring in the pooled calcium scoring and lung cancer screening cohorts, with an ICC of 0.947 for Agatston scores and {kappa} of 0.936 for CVD risk classification. For liver transplant candidates, the algorithm exhibited substantial agreement with expert scoring of non-gated CT scans ({kappa}=0.79) and a sensitivity of 90.4% and specificity of 96.4% in high-risk cases. Conclusion: These findings suggest that DL-based CAC scoring on non-gated CT scans may be a feasible alternative to traditional methods and could support opportunistic cardiovascular risk assessment in routine imaging. Further validation is warranted to assess clinical integration in broader practice settings.
Matthews, G. A.; Godson, L.; McGenity, C.; Bansal, D.; Treanor, D.
Show abstract
BO_SCPLOWACKGROUNDC_SCPLOWThere is increasing momentum behind the clinical implementation of AI-based software for image analysis in digital pathology. As regulations, standards, and national approaches to the clinical use of AI continue to develop, the marketplace of AI products is expanding and evolving - presenting pathologists with a multitude of devices that offer the potential to improve pathology services. MO_SCPLOWETHODSC_SCPLOWTo maintain pace with this changing AI device landscape, we conducted a comprehensive search for, and analysis of, commercial AI products for image analysis in digital pathology. This included CE-marked and Research Use Only (RUO) products using images with histological stains (e.g., H&E) or immunohistochemical (IHC) labelling. Product information and published clinical validation studies were assessed, to understand the quality of supporting evidence on available products, and product details were compiled into a public register: https://osf.io/gb84r/overview. RO_SCPLOWESULTSC_SCPLOWIn total, we identified and assessed 90 CE-marked and 227 RUO AI products. We found that AI products for cancer detection in prostate and breast pathology comprised a substantial portion of the marketplace for H&E image analysis, while IHC products were almost exclusively for use in breast cancer. Clinical validation studies on these products have steadily increased; however, we found that published studies were only available for just over half of H&E products and just over a quarter of IHC products. For CE-marked products, the dataset quality and diversity for AI model performance validation was highly variable, and particularly limited for IHC products. Furthermore, only a limited number of products included studies that assessed measures of clinical utility. CO_SCPLOWONCLUSIONC_SCPLOWAs clinical deployment of AI products for image analysis in histopathology grows, there is a need for transparency, rigorous validation, and clear evidence supporting clinical utility and cost-effectiveness. Independent scrutiny of the expanding offering of AI products provides insight into the opportunities and shortcomings in this domain.
Qiu, P.; An, Z.; Ha, S.; Kumar, S.; Yu, X.; Sotiras, A.
Show abstract
Multimodal medical image analysis exploits complementary information from multiple data sources (e.g., multi contrast Magnetic Resonance Imaging (MRI), Diffusion Tensor Imaging (DTI), and Positron Emission Tomography (PET)) to enhance diagnostic accuracy and support clinical decision making. Central to this process is the learning of robust representations that capture both modality invariant and modality specific features, which can then be leveraged for downstream tasks such as MRI segmentation and normative modeling of population level variation and individual deviations. However, learning robust and generalizable representations becomes particularly challenging in the presence of missing modalities and heterogeneous data distributions. Most existing methods address this challenge primarily from a statistical perspective, yet they lack a theoretical understanding of the underlying geometric behavior such as how probability mass is allocated across modalities. In this paper, we introduce a generalized geometric perspective for multimodal representation learning grounded in the concept of barycenters, which unifies a broad class of existing methods under a common theoretical perspective. Building on this barycentric formulation, we propose a novel approach that leverages generalized Wasserstein barycenters with hierarchical modality specific priors to better preserve the geometry of unimodal distributions and enhance representation quality. We evaluated our framework on two key multimodal tasks brain tumor MRI segmentation and normative modeling demonstrating consistent improvements over a variety of multimodal approaches. Our results highlight the potential of scalable, theoretically grounded approaches to advance robust and generalizable representation learning in medical imaging applications.
Agumba, J.; Erick, S.; Pembere, A.; Nyongesa, J.
Show abstract
Abstract Objectives: To develop and evaluate a deployable deep learning system with Gradient-weighted Class Activation Mapping (Grad-CAM) for tuberculosis screening from chest radiographs and to assess its classification performance and explainability across desktop and mobile deployment platforms. Materials and methods: This study used publicly available chest X-ray datasets containing Normal and Tuberculosis images. A DenseNet121-based transfer learning model was trained using stratified training, validation, and test splits with data augmentation and class weighting. Model performance was evaluated using accuracy, precision, recall, F1 score, receiver operating characteristic (ROC) curve, and area under the ROC curve (AUC). Grad-CAM was used to visualize regions influencing model predictions. The trained model was converted to TensorFlow Lite and deployed in both a Windows desktop application and a Flutter-based mobile application for offline inference and visualization. Results: The model demonstrated strong classification performance on the independent test dataset, with high accuracy and AUC values indicating effective discrimination between Normal and Tuberculosis cases. Grad-CAM visualizations showed that the model focused primarily on anatomically relevant lung regions, particularly the upper and mid-lung fields in Tuberculosis cases. Deployment testing confirmed consistent prediction outputs and Grad-CAM visualizations across both Windows and mobile platforms. Conclusion: The proposed deployable deep learning system with Grad-CAM provides accurate and interpretable tuberculosis screening from chest radiographs and demonstrates feasibility for offline mobile and desktop deployment. This approach has potential as an artificial intelligence-assisted screening and decision support tool in radiology, particularly in resource-limited and remote healthcare settings.
Bizjak, Z.; Zagar, J.; Spiclin, Z.
Show abstract
Automated and reliable image quality assessment (IQA) is essential for safe use of medical image synthesis in critical applications like adaptive radiotherapy, treatment planning, or missing-modality reconstruction, where unnoticed generative artifacts may adversely affect outcomes. We evaluated image-to-image translation quality by coupling large-scale expert visual quality assessment with explainable automated IQA modeling. Adversarial diffusion-based framework, SynDiff, was applied to four cross-modality synthesis tasks, including three inter-MR and a CBCT-to-CT translation. Using four-fold cross-validation, ten reference-based and eight no-reference IQA metrics were computed for all synthesized images. Visual IQA ratings were independently collected from thirteen expert raters using predetermined protocol and specialized image viewer enabling blinded, randomized six-point Likert scoring. Auto-Sklearn was employed to learn ensemble regression models mapping IQA metrics to visual consensus ratings, with separate models trained on reference-based and no-reference metrics. The models closely reproduced distribution and ordering of expert ratings, typically within +/- 0.5 Likert points. Reference-based models achieved higher agreement with visual ratings than no-reference models (R^2 0.75 vs. 0.59, resp.), although the latter remained unbiased and informative. Explainability analyses highlighted structure- and contrast-sensitive metrics as key predictors. Overall, the results demonstrate that ensemble regression models can provide transparent, scalable, and clinically meaningful quality control for generative medical imaging.
Fenney, E.; Muralidharan, L.; Ruffle, J. K.; Pandit, A.; Millip, M.; Hammam, A.; Brookes, T.; Jabeen, F.; Colman, J.; Sarwani, O.; Alattar, K.; Efthymiou, E.; Kallam, N.; Siddiqui, J.; Marcus, H. J.; Nachev, P.; Hyare, H.
Show abstract
Background: Meningiomas are the most common primary intracranial tumors in adults, and volumetric assessment increasingly guides surveillance and treatment decisions. Automated segmentation could enable standardized volumetry but requires robust validation. Purpose: To develop a fully automated three-dimensional deep learning model for meningioma segmentation on multiparametric MRI, and to evaluate segmentation accuracy, external generalizability, failure modes, radiologist-rated clinical plausibility, and workflow feasibility. Methods: From 2024 to 2026, this retrospective study trained a custom 3D nnU-Net residual encoder model. Expert segmentations covered enhancing tumor (ET), tumor core (TC), and whole tumor (WT). Dice similarity coefficient (DSC) was the primary metric. External validation used an independent single-institution dataset (n = 310 intracranial cases) with incomplete MRI protocols. Failure modes, model equity, and inference time were assessed. A blinded multi-rater study (10 radiologists; 510 cases) rated TC segmentations using a 0-10 Likert scale, analyzed with linear mixed-effects models. Results: Model training used the BraTS Meningioma 2023 dataset (n = 1000; mean age 60.2 {+/-} 14.5; 705 female). In cross-validation, mean DSC was 0.939 for ET, 0.937 for TC, and 0.921 for WT. In external validation, mean DSC was 0.872 for TC and 0.842 for WT, despite heterogeneous protocols and incomplete sequences. Predicted TC volumes correlated strongly with reference volumes in cross-validation (r = 0.995) and external validation (r = 0.971). Most common failure modes were skull base and intraosseous tumors with performance equitable across demographic subgroups. Mean inference time was 1.2 seconds. In blinded evaluation (1120 ratings), model segmentations received higher scores than reference annotations (+0.32 BraTS; +1.38 external validation). Conclusion: A fully automated deep-learning model achieved high meningioma segmentation accuracy across multi-institutional training data and external clinical imaging. In a blinded study, model segmentation quality exceeded reference annotations, and 1.2-second inference supported workflow integration. Prospective evaluation is warranted before routine deployment.